Skip to content

refactor: pipeline RNG convention + warnings.warn#34

Merged
shaypal5 merged 3 commits into
mainfrom
refactor/pipelines-rng-and-warnings
Apr 30, 2026
Merged

refactor: pipeline RNG convention + warnings.warn#34
shaypal5 merged 3 commits into
mainfrom
refactor/pipelines-rng-and-warnings

Conversation

@shaypal5

Copy link
Copy Markdown
Contributor

Summary

  • Add RNGRoot.numpy_child() method for numpy RandomState substreams, bridging the project's RNG convention with pandas/numpy stochastic operations
  • Pipeline functions (build_v5, build_v6) now accept seed: int instead of np.random.RandomState, deriving per-step RNGs via RNGRoot internally
  • Replace print(..., file=sys.stderr) with warnings.warn() in library code; CLI scripts keep print() for user-facing progress
  • Update all tests: capsyspytest.warns(), pass seed= kwarg

Closes #31, closes #32

Test plan

  • 740 tests pass (3 new for numpy_child, updated v5/v6 pipeline tests)
  • Determinism preserved: same seed → same output verified by existing parametrized tests
  • ruff check + ruff format clean
  • Warning tests use pytest.warns(UserWarning, match=...) for specificity

🤖 Generated with Claude Code

- Add RNGRoot.numpy_child() for numpy RandomState substreams
- Pipeline functions (build_v5, build_v6) accept seed: int instead of
  np.random.RandomState, deriving per-step RNGs via RNGRoot internally
- Replace print(..., file=sys.stderr) with warnings.warn() in library
  code; CLI scripts keep print() for user-facing progress messages
- Update tests: capsys → pytest.warns(), pass seed= kwarg
- Add 3 tests for numpy_child()

Closes #31, closes #32

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 30, 2026 21:35
@shaypal5 shaypal5 added type: refactor Code change with no behavior difference layer: core core/ primitives (RNG, IDs, models, exceptions) layer: pipelines pipelines/ — dataset build transformations labels Apr 30, 2026
@github-actions

This comment has been minimized.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aligns the v5/v6 dataset pipeline RNG interface with the repo’s RNGRoot convention and replaces stderr print() warnings in library code with warnings.warn(), while updating scripts/tests to the new seed-based API.

Changes:

  • Add RNGRoot.numpy_child(name) to provide deterministic NumPy/Pandas-compatible RNG substreams.
  • Refactor v5/v6 pipeline functions to accept seed: int (instead of np.random.RandomState) and derive per-step RNGs internally via RNGRoot.
  • Replace print(..., file=sys.stderr) with warnings.warn(..., stacklevel=2) in pipeline code and update tests to assert warnings with pytest.warns().

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
leadforge/core/rng.py Adds RNGRoot.numpy_child() for deterministic NumPy/Pandas RNG substreams.
leadforge/pipelines/build_v5.py Switches pipeline steps to seed-based RNG derivation and replaces stderr prints with warnings.warn().
leadforge/pipelines/build_v6.py Same refactor as v5, including seed-based RNG derivation and warning emission.
scripts/build_v5_snapshot.py Updates CLI orchestration to pass seed to pipeline functions (keeps CLI stderr progress prints).
scripts/build_v6_snapshot.py Same as v5 script: pass seed to pipeline functions and remove unused NumPy RNG.
tests/core/test_rng.py Adds tests for numpy_child() determinism/independence.
tests/scripts/test_build_v5_snapshot.py Updates tests to pass seed= and assert warnings via pytest.warns() instead of capsys.
tests/scripts/test_build_v6_snapshot.py Updates tests to pass seed= across v6 pipeline steps.
.agent-plan.md Documents completion checklist for the refactor work.
Comments suppressed due to low confidence (1)

leadforge/pipelines/build_v6.py:173

  • The assign_acquisition_wave docstring says the noise is added “at the boundaries”, but the implementation applies ~5% noise across all rows (noise_mask = rng.random(n) < 0.05). Either constrain the noise mask to boundary regions or update the docstring so it matches the actual behavior.
    """Assign acquisition_wave (A, B, C) based on lead index position.

    Waves A/B/C are roughly chronological: first third = A, middle = B,
    last third = C. A small amount of noise is added at the boundaries.
    """
    rng = RNGRoot(seed).numpy_child("acquisition_wave")
    df = df.copy()
    n = len(df)
    waves = np.empty(n, dtype=object)
    third = n // 3

    # Base assignment by position (stable across seeds)
    waves[:third] = "A"
    waves[third : 2 * third] = "B"
    waves[2 * third :] = "C"

    # Add ~5% boundary noise so it's not perfectly deterministic
    noise_mask = rng.random(n) < 0.05
    noise_vals = rng.choice(["A", "B", "C"], size=n)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread leadforge/core/rng.py Outdated
… import

- Regenerate v6 CSVs to match new per-function RNG derivation
  (old CSVs were stale after RNG refactor changed random streams)
- Re-validate: all mandatory checks pass with updated metrics
- Update RELEASE_v6.md with new baseline numbers
- Fix misleading numpy_child() docstring: clarify it uses 4-byte
  truncation (not 8-byte like child()), producing independent streams
- Make numpy import lazy in core/rng.py to keep the module importable
  without numpy as a hard dependency
- Document RandomState-vs-Generator choice in docstring (migration
  deferred, tracked separately)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

CI ruff 0.15.12 enforces PT030: pytest.warns() must have match parameter.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 30, 2026 21:42
@github-actions

Copy link
Copy Markdown

pr-agent-context report:

This run includes an unresolved review comment on PR #34 in repository https://github.com/leadforge-dev/leadforge

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.

# Copilot Comments

## COPILOT-1
Location: leadforge/core/rng.py
URL: https://github.com/leadforge-dev/leadforge/pull/34#discussion_r3170991378
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    `numpy_child()` docstring says it uses the same derivation as `child()`, but the implementation re-hashes and truncates to 4 bytes (vs `child()` truncating to 8). To keep RNG derivation consistent across the repo, consider deriving the numpy seed from `self.child(name)` (e.g., `getrandbits(32)`/`getrandbits(64)` then adapt to RandomState) and update the docstring to reflect the actual truncation/derivation rules.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25190774651 attempt 1
Comment timestamp: 2026-04-30T21:42:55.068652+00:00
PR head commit: ffabbcb2c0cfdf5cd2c8ac600c0fe17266e036eb

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aligns the dataset pipeline code with the project’s RNG conventions and replaces stderr print() warnings in library code with warnings.warn(), while updating scripts and tests accordingly.

Changes:

  • Added RNGRoot.numpy_child(name) to produce deterministic NumPy RandomState substreams for pandas/numpy operations.
  • Refactored v5/v6 pipeline step functions to accept seed: int and internally derive named RNG substreams via RNGRoot.
  • Replaced print(..., file=sys.stderr) warnings in library pipeline steps with warnings.warn(..., stacklevel=2) and updated tests to assert warnings with pytest.warns().

Reviewed changes

Copilot reviewed 10 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
leadforge/core/rng.py Adds numpy_child() for deterministic NumPy RNG substreams derived from the root seed.
leadforge/pipelines/build_v5.py Pipeline steps now take seed and use warnings.warn() instead of stderr prints.
leadforge/pipelines/build_v6.py Same refactor as v5, including per-step named numpy RNG substreams.
scripts/build_v5_snapshot.py Script now passes seed into pipeline steps; removes local RandomState.
scripts/build_v6_snapshot.py Script now passes seed into pipeline steps; removes local RandomState.
tests/core/test_rng.py Adds tests covering numpy_child() type + determinism + name independence.
tests/scripts/test_build_v5_snapshot.py Updates calls to pass seed= and switches stderr assertions to pytest.warns().
tests/scripts/test_build_v6_snapshot.py Updates calls to pass seed= for all stochastic pipeline steps.
lead_scoring_intro/RELEASE_v6.md Updates reported metrics/uplift values for v6 release notes.
.agent-plan.md Updates plan/checklist with the new pipeline RNG + warnings refactor and new metrics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +105 to +109
| ROC-AUC | 0.667 |
| PR-AUC | 0.429 |
| Base rate | 30.0% |
| Precision@25 | 0.480 (Lift: 1.60x) |
| Precision@50 | 0.420 (Lift: 1.40x) |
| Precision@25 | 0.520 (Lift: 1.73x) |
| Precision@50 | 0.480 (Lift: 1.60x) |
Comment on lines 113 to +116
| Model | Mean AUC | vs LR |
|---|---|---|
| Logistic Regression | 0.658 | — |
| GBM (100 trees) | 0.680 | +0.022 |
| Logistic Regression | 0.627 | — |
| GBM (100 trees) | 0.643 | +0.016 |
| 50 | $1,379,208 | $1,949,380 | +41.3% |
| 25 | $1,203,430 | $1,380,990 | +14.8% |
| 50 | $1,809,281 | $2,014,459 | +11.3% |

@shaypal5 shaypal5 merged commit 4b2f74b into main Apr 30, 2026
11 checks passed
@shaypal5 shaypal5 deleted the refactor/pipelines-rng-and-warnings branch April 30, 2026 21:54
shaypal5 added a commit that referenced this pull request May 1, 2026
* refactor: wire sample_hidden_graph RNG through RNGRoot

Change sample_hidden_graph() to accept an RNGRoot instance instead of a
bare int seed, aligning it with the cross-component RNG convention
established in PR #34. The old int-seed interface is preserved with a
deprecation warning for backward compatibility.

Closes #9

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: drop backward-compat shim, clean up signature

Address self-review feedback:
- Remove unnecessary deprecation warnings — sample_hidden_graph is
  internal, no external consumers need backward compat
- Make rng_root a required RNGRoot parameter (no union type)
- Add strict TypeError for non-RNGRoot input
- Create RNGRoot once in generator.py rather than inline
- Replace deprecation tests with proper type-error tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

layer: core core/ primitives (RNG, IDs, models, exceptions) layer: pipelines pipelines/ — dataset build transformations type: refactor Code change with no behavior difference

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pipelines: align RNG interface with RNGRoot convention pipelines: replace stderr prints with warnings.warn in library functions

2 participants